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Abstract. A search technique locating network modules, i.e. , internally densely 
connected groups of nodes in directed networks is introduced by extending the Clique 
Percolation Method originally proposed for undirected networks. After giving a 
suitable definition for directed modules we investigate their percolation transition 
in the Erdos-Renyi graph both analytically and numerically. We also analyse four 
real-world directed networks, including Google's own web-pages, an email network, 
a word association graph and the transcriptional regulatory network of the yeast 
Saccharomyces cerevisiae. The obtained directed modules are validated by additional 
information available for the nodes. We find that directed modules of real- world graphs 
inherently overlap and the investigated networks can be classified into two major groups 
in terms of the overlaps between the modules. Accordingly, in the word-association 
network and Google's web pages overlaps are likely to contain in-hubs, whereas the 
modules in the email- and transcriptional regulatory network tend to overlap via out- 
hubs. 
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1. Introduction 

A widespread approach to the analysis of complex natural, social and technological 
phenomena is to assemble the participating molecules, individuals or electronic devices 
and their interactions into a network (nodes and links) and to infer functional 
characteristics of the entire system from this static web of connections [TJ [2]. This 
approach is rooted in, among others, statistical physics, where often the thermodynamic 
limit (N — > oo, where N is the number of nodes) is considered, and the overall 
(large-scale) structure of connections is studied rather than the details at the level 
of nodes and links. Accordingly, over the past few years, several broadly studied large- 
scale properties of real- world webs have been uncovered, e.g. , a low average distance 
combined with a high average clustering coefficient [3], the broad (scale-free) distribution 
of node degree (number of connections of a node) [U El El [7] and various signatures 
of hierarchical/modular organisation [HE]. In addition, detailed analyses of the small- 
scale behaviour of the same complex webs have revealed overrepresented local structures: 
graph motifs [IOj[TT], i.e. , small groups of nodes (typically of size 3 — 5) with specifically 
arranged connections among them. The identified small- and large-scale properties are 
both closely related to the dynamical behaviour of the corresponding complex system. 
Nodes with many connections (hubs) often have a central role in traffic [12] . while motifs 
act as building blocks performing distinct basic information processing tasks |13j . 

The inter mediate-scale substructures in networks (units larger than motifs), 
made up of vertices more densely connected to each other than to the rest of the 
network, are often referred to as communities, modules, clusters or cohesive groups 
[HI I5l HU [T71 [291 HH EU [19] with no widely accepted, unique definition. In the various 
types of networks these groups can represent, e.g. , communities of people [Ml EH l2"T] . 
functional units in biology [8j [22] and set of tightly coupled stocks or industrial sectors 
in economy [23]. A reliable method to pinpoint network modules has many potential 
industrial application, e.g. , it can help service providers (phone, banking, web, etc. ) 
identify meaningful groups of customers (users), or support biomedical researchers in 
their search for individual target molecules and novel protein complex targets [2^1125] . In 
addition, modules, and also some small subgraphs, are appropriate for "coarse-graining" 
complex networks: each module/subgraph can be represented as a node and two such 
node can be linked, if the corresponding modules/subgraphs are connected (or overlap) 

[IHJEHIET]. 

The key requirements towards network module search techniques [T9l [28l [30] are 
that they should be local, based on link density, and error-tolerant (the removal or 
insertion of a link may alter only nearby modules). Furthermore, as dense groups in 
real-world graphs often overlap with each other, the module finding methods should 
allow overlaps between the groups. For example, in a social web each person belongs 
to several groups (family, colleagues and friends), in a protein interaction network 
each protein participates in multiple complexes [32] and a large portion of web- 
pages is classified under multiple categories [33J. Prohibiting overlaps during module 
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rolling the k-clique 




Figure 1. Illustration of the Clique Percolation Method (CPM) [HI [34] with fc-clique 
template rolling in a small undirected graph for k = 4. Initially the template is placed 
on A-B-C-D (left panel) and it is "rolled" onto the subgraph A-C-D-E (middle panel). 
The position of the fc-clique template is marked with thick black lines and black nodes, 
whereas the already visited links are represented by thick gray lines and dark gray 
nodes. Observe that in each step only one of the nodes is moved and the two 4-cliques 
(before and after rolling) share k — 1 = 3 nodes. At the final step (right panel) the 
template reaches the subgraph C-D-E-F, and the set of nodes visited during the process 
(A-B-C-D-E-F) are considered as a module identified by the CPM at k = 4. 

identification strongly increases the percentage of false negative co-classified pairs. As 
an example, in a social web a group of colleagues might end up in different modules, each 
corresponding to their families, and, in this case, the network module corresponding to 
their work unit is bound to become lost. 

A recent link-density based approach to module finding, fulfilling the above 
requirements, is provided by the Clique Percolation Method (CPM) [T9l 13"%] . In this 
approach, the definition of the modules is based on fc-cliques (complete subgraphs of 
size k in which each node is connected to every other node). A fc-clique is a sub-graph 
with maximal possible link density, therefore it is a good starting point for defining 
modules. However, a method accepting only complete sub-graphs as modules would 
be too restrictive. Therefore, fc-cliques are "loosen up" in the following way. Two k- 
cliques are said to be adjacent if they share k — 1 nodes (or in other words, if they differ 
only in a single node), and a module is defined as the union of fc-cliques that can be 
reached from each other through a series of adjacent A;-cliques. Such modules can be 
best visualised with the help of a /c-clique template (an object isomorphic to a complete 
graph of k vertices). Such a template can be placed onto any fc-clique in the graph, 
and rolled to an adjacent fc-clique by relocating one of its vertices and keeping its other 
k — 1 vertices fixed. Thus, the /c-clique modules (/c-clique communities) of a graph are 
all those subgraphs that can be fully explored by rolling a /c-clique template in them, 
but cannot be left by this template, as illustrated in FigJU The algorithm used for the 
implementation of this technique is very efficient for most real networks, and provides 
the full list of overlapping modules in a short amount of time [T9l [35] . 

A common shortcoming of current module finding methods is that they ignore the 
possible directionality of the links during the analysis of a network. The direction of a 
single link in most real network signals either the direction of some kind of flow (e.g. , the 
flow of information, energy), or the asymmetry of the relation between the nodes (e.g. , 
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a superior- inferior relation). Consequently, nodes possessing mostly incoming links are 
expected to play a very different role in the network (or within the modules they belong 
to) from those possessing mostly outgoing links or from those having a similar amount 
of both kinds of links. Therefore, as a first attempt to take into consideration the 
directionality of links, we propose a simple measure for the nodes within the modules 
to characterise their roles in terms of the numbers of their incoming and outgoing links. 

At the same time the consideration of directionality in modules raises the question 
of whether a module searching algorithm that inherently takes into account the 
directionality of links is more suitable for directed networks than the original undirected 
algorithms. Along this idea, we define the notion of directed fc-cliques (in which the 
configuration of the directed links has to meet certain criteria), and propose a restricted 
version of CPM (denoted as CPMd), in which only directed fc-cliques can be used for the 
identification of modules. We apply this method to several networks: first, we examine 
the percolation transition of the directed fc-cliques in the Erdos-Renyi (ER) random 
graph [36], then move on to study the directed modular structure of four real- world 
networks, including a word-association network, Google's web-pages, an email network, 
and the transcriptional regulatory graph of yeast. The identified directed modules are 
verified with the help of additional information (protein functional annotations, web- 
page names, and word usage frequencies) about the nodes. 

2. Definitions 

In undirected graphs a pair of nodes is either connected or not, whereas in a directed 
graph the same pair, (A,B), can be connected in three ways: either by a "single link" as 
(i) A^B and (ii) A<— B or by a "double link" as (iii) A^B. Multiple links (i.e. , more 
than one link between A and B in the same direction) and self-links (such as A^A) 
are not allowed. In the following we first define a simple measure for comparing nodes 
within a module based on the directionality of their links, then introduce the concept 
of directed fc-cliques, the fundamental objects of our directed module finding approach. 

2.1. Comparing the nodes according to their relative out-degree 

A natural and simple approach to relate nodes in a module to each other is to compare 
the number of their incoming and outgoing links connected to other members in the 
module. For example, a node having only out-neighbours amongst the members of the 
module can be viewed a "source" or a "top-node" , whereas a node with only incoming 
links from these members is a "drain" or a "bottom-node". Most nodes, however, fall 
somewhere between these two extremes. To quantify this property, we introduce the 
relative in-degree and relative out-degree of node i in module a as 
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D a = — (lb) 

"i,in u j,out 

where c?" in and d" out denote the number of in-neighbours and out-neighbours amongst 
the other nodes in the module, respectively. Obviously the values of both Df ont and Df m 
are in the range between and 1, and the relation Df in + Df out = 1 holds. For weighted 
networks, (11al|l&l) can be replaced by the relative in-strength and relative out-strength 
defined as 

w a - 

W « = ^ ( 2fl ) 

2,111 rv i .cv ' V / 

<in + <out 

W a = — (2b) 

l ' out ~ w a - +w a ' 1 ' 

w i,in ' "%out 

where iwf out and u>f out denote the aggregated weight of out-going and incoming 
connections with other members in the module a. 



2.2. Directed k-cliques and the directed Clique Percolation Method (CPMd) 

In a complete sub-graph of size k the k(k — l)/2 links can be directed in ^(h- 1 )/ 2 
ways. Since the undirected CPM treats these alternatives as identical, introducing link 
directions allows a large variety of possible rules for defining directed modules. A natural 
concept, however, is to aim for "directed modules" preserving some kind of directedness 
as a whole, rather than just being a collection of nodes connected by directed edges. 

Therefore, we replace the /c-cliques (the fundamental objects of the CPM) by 
directed k-cliques, which are defined as complete sub-graphs of size k in which an ordering 
can be made such that between any pair of nodes there is a directed link pointing from 
the node with the higher order towards the lower one. Since the presence of double links 
usually leads to multiple possibilities to order the nodes in a way fulfilling the above 
requirement, for simplicity we first concentrate on directed fc-cliques with no double 
links. In this case, the higher the order of a node, the more out-neighbours it has in 
the fc-clique (see illustration in FigfJ^i). Thus, the restricted out-degree of a node in the 
fc-clique (the number of its out-neighbours in the /c-clique, ranging from to k — 1) can 
be assigned as its order. From this, it can be seen easily (for details see Appendix A) 
that the condition for a /c-clique with no double links to qualify as a directed /c-clique 
is equivalent to the following three conditions: 

(i) Any directed link in the /c-clique points from a node with a higher order (larger 
restricted out-degree) to a node with a lower order. 

(ii) The /c-clique contains no directed loops (where a "directed loop" is a closed directed 
path). 

(iii) The restricted out-degree of each node in the /c-clique is different. 

The overall directionality of such an object naturally follows the ordering of the nodes: 
the node with highest order is the one which has only out-neighbours, and can be viewed 
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directed k-clique? 

Figure 2. Groups of nodes forming a directed fc-clique (a, c) and groups (b, d) 
that do not. (a) A directed fc-clique without double links. The index of each node 
corresponds to its order (which is equivalent to number of its out-links) within the 
directed fc-clique. (b) A complete sub-graph without double links, but not accepted as 
a directed fc-clique, because it contains a directed loop, (c) A directed fc-clique with a 
double link. Note that the order of the nodes depends on which link is deleted from the 
double link, (d) Double link in a complete sub-graph that is not a directed k clique. 
It is not possible to remove a link from the double link in a way that all directed loops 
disappear. 

as the "source" or "top"-node of the fc-clique, whereas the node with lowest order has 
only incoming links from the others, and corresponds to a "drain" or "bottom" node. 

None of the above three conditions holds in the presence of double links: directed 
loops appear in the fc-clique, the restricted out-degree of at least two nodes in the fc- 
clique becomes the same (see Appendix A), and we can find directed links pointing in 
the direction of increasing order. However, based on the ordering of the nodes, it is 
always possible to eliminate the double links (by removing all links that point towards 
higher order) from a directed fc-clique in such a way that the remaining single links fulfil 
all three conditions. See FigfJk as an example. 

The fc-clique adjacency is defined similarly to the undirected case: two directed 
fc-cliques are adjacent if they share k — 1 nodes. The directed fc-clique modules (the 
CPMd modules) arise as the union of directed fc-cliques that can be reached from each 
other through a series of fc-clique adjacency. The fc-clique template rolling picture can 
be applied to illustrate the CPMd modules in the same fashion as in the undirected 
case. The searching algorithm locating the CPMd modules is described in Appendix B. 
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We note that the above definition of a directed /c-clique is just one possibility among 
many others. Natural choices that also impose some kind of directionality on the fc-clique 
include e.g. the requirement that at least one of the nodes should have out-links (or in- 
links) towards (from) all the other k — 1 nodes, or the requirement that the nodes could 
be divided into two non-empty sets such that each node in the first set has an out-link 
towards each node in the second set (resembling directed hyper-edges). Our particular 
choice was motivated, on the one hand, by the fact that it is more restrictive than the 
others (providing a more specific tool to investigate the effects of directionality) and, on 
the other hand, by our finding that for most real world networks even such a restricted 
definition results in directed modules that are notably similar to the undirected ones 
(see SecfCoD. 

3. Percolation transition in the directed ER graph 

The concept of (undirected) random graphs was introduced by Erdos and Renyi [36] in 
the 1950s in a simple model consisting of N nodes and connecting every pair of nodes 
independently with the same probability p. Even though real networks differ from this 
simple model in many aspects, the ER graph remains still of great interest, since such a 
graph can serve both as a test bed for checking all sorts of new ideas concerning complex 
networks in general, and as a prototype of random graphs to which all other random 
graphs can be compared. 

Perhaps the most conspicuous early result on the ER graphs was related to the 
percolation transition taking place at p = 1/N. The appearance of a giant component in 
a network, which is also referred to as the percolating component, results in a dramatic 
change in the overall topological features of the graph and has been in the centre of 
interest for other networks as well. In a more general framework, one can also address 
the question of A;-clique percolation in the ER graph. Simple theoretical arguments as 
well as numerical simulations [34J show that the critical linking probability of fc-clique 
percolation is p" ndir = [(k — l^iV] -1 '^ -1 ). In this section we carry out a similar analysis 
concerning the percolation transition of directed /c-cliques in the directed ER graph. 

3.1. Derivation of the critical point 

The directed equivalent of the ER graph consists of iV nodes providing N(N — 1) 
possible "places" for the directed links, and these are filled independently with uniform 
probability p, producing on average M ~ iV(iV — l)p edges. (Note that in the original 
undirected ER graph there are only iV(iV — l)/2 possibilities to introduce an edge, 
therefore, at linking probability p, there are only M ~ iV(iV — l)p/2 connections). The 
critical linking probability p c is decreasing with increasing N, and converges to zero as 
N — > oo. We restrict our self to the large N limit, and evaluate p c to leading order only. 
Let us suppose that we approach the critical point from below: the directed /c-cliques do 
not assemble yet into a giant module, we can find only small, isolated modules, and the 



Directed network modules 



8 



system is dispersed. In terms of our /c-clique template rolling picture this means that 
when trying to explore the directed percolation clusters by rolling such a template on 
them, we must stop the rolling after a few steps as we run out of unexplored adjacent 
directed /c-cliques. 

One can estimate p c from the condition that at the critical point the average number 
of yet unexplored directed /c-cliques adjacent to the /c-clique we have just reached 
becomes equal to one. (This makes it possible to roll our template on and on for a 
long time). Since we are going to evaluate p c to leading order only, we can neglect the 
possibility to roll our /c-clique template using double edges between the same nodes: 
When reaching a directed /c-clique, the minimal number of further edges that must be 
present to enable the continuation of the template rolling is k — 1. The probability of 
such a case is therefore proportional to p k ^ 1 . Even though it is not forbidden in the 
first place to continue using double edges as well, each double edge in the new directed 
/c-clique we are going to roll onto multiplies the probability by p. In other words, the 
probability to roll further to a /c-clique containing one double edge is smaller by a factor 
of p, the probability to roll further to a /c-clique containing two double edges is smaller 
by a factor of p 2 , etc. 

During the branching process exploring a directed /c-clique percolation cluster, at 
the point when we are about to roll our template further on, we can choose the next 
node for relocation in k — 1 different ways, which can then be relocated to approximately 
N places. If there were no restrictions for the directioning of the links inside a directed 
/c-clique, then the k — 1 new links connecting the new node to this k — 1 shared nodes 
could be directed in 2 fc_1 ways. However, the new directed /c-clique has to fulfil the 
three condition detailed in Section 12.21 as well, therefore the actual number of allowed 
configurations is much smaller. The rank of the new node in the new directed /c-clique 
can be chosen in k ways: the k — 1 nodes shared with the previous /c-clique are already 
ordered, and we can "insert" the new node to any place in this hierarchy. By fixing the 
order of the new node we fix the direction of the new links as well, therefore we can 
allow only k different configuration for the directionality of these links. By combining 
these factors together, the condition for reaching the critical point of the percolation 
transition can be written as 

p k ~ x N{k - l)k = 1, (3) 

from which we gain 

p theor = { Nk ( k _ 1 )]-V(fc-1) = pUndir/j.fc-1 ( 4 ) 

for the theoretical prediction of the critical edge probability. Note that in the limiting 
case of k = 2 (the directed edge percolation), the p* heor = p" ndir /2 relation holds, which 
is consistent with the 2:1 ratio for the number of links in the directed- and undirected 
ER graph respectively. 
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3.2. Numerical simulations 

There are two plausible choices to measure the size of the largest directed fc-clique 
percolation cluster. The most natural one, which we denote by N*, is the number of 
nodes belonging to this cluster. We can also define an order parameter associated with 
this choice as the relative size of this cluster: 



The other choice is the number A/"* of directed fc-cliques of the largest directed /c-clique 
percolation cluster. The associated order parameter is again the relative size of this 
cluster: 



where J\f denotes the total number of directed fc-cliques in the graph. In FigEh-b we 
display $ and \I/ as functions of p/p* heor , where the directed fc-clique size is k — 4, and 
the system size varies between N = 50 and iV = 1600. The order parameter $ converges 
to a step function for increasing system sizes, whereas \I/ converges to a limit function 
(which is for p/p c {k) < 1 and grows continuously to 1 above p/p c {k) = 1). We have 
evaluated the transition point numerically as well, by computing the second moment of 
the distribution of A/i values, excluding the largest one, A/i = Af*: 



Note that this quantity is analogous to the percolation susceptibility. Both below and 
above the transition point the Mi {i > 1) values follow an exponential distribution, and 
only at p c do they have a power-law distribution. Thus, x is maximal at the numerical 
transition point, p" um . In Figj3]3 we show x calculated for the curves shown in Figj3b, 
as the function of p/p* heor . In order to check the theoretical prediction for the critical 
point obtained in (jlj) we have carried out a finite-size scaling analysis of the numerical 
results. In Fig.[3]i we show the ratio p° um /p* heor as a function of 1/N. Indeed, for large 
systems, the above ratio converges to one roughly as 1 + cN^ 1 ^ 2 . 

4. Results for real-world graphs 

In this section we study the directed modular structure of four real-world networks 
ranging from a word association graph through Google's web-pages to email and 
transcription regulatory networks. When applied to real networks, the CPMd method 
has two parameters: the fc-clique size k, and (if the network is weighted) a weight 
threshold w* (links weaker than w* are ignored). Changing the threshold is like changing 
the resolution (as in a microscope) with which the modular structure is investigated: 
by increasing w* the modules start to shrink and fall apart. A very similar effect 
can be observed by changing the value of k as well: increasing k makes the modules 
smaller and more isolated from each other, but at the same time, each module becomes 
more cohesive. When we are interested in the modular structure around a particular 



$ = N*/N. 



(5) 



(6) 




(7) 
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p/p c 



Figure 3. Numerical results for directed fc-clique percolation in ER-graphs. In each 
sub-figure, points show an average over 4 to 100 simulations depending on system size, 
a) The order parameter <E> (the number of nodes in the largest percolation cluster 
divided by N) as a function of p/p* r , where 

ptheor wag 

obtained from Eq. 0$. b) 

The order parameter * (the number of directed fc-cliques in the largest percolation 
cluster divided by the total number of directed fc-cliques) as a function of p/p* heor . 
c) The numerically determined value for the critical linking probability, p" um , defined 
as the average location of the maximum of x(p), playing the role of the normalised 
percolation susceptibility (see Eq. [7]) . d) Verification of the theoretical prediction for 
the critical point. The p™ m /p|; hoor ratio converges to one for large N. 



node, it is advisable to scan through some ranges of k and w*, and monitor how the 
obtained modules change. Meanwhile, when analysing the modular structure of the 
entire network, the criterion used to fix these parameters is based on finding a modular 
structure as highly structured as possible [19]. This can be achieved by tuning the 
parameters just below the critical point of the percolation transition. In this way we 
ensure that we find as many modules as possible, without the negative effect of having 
a giant module that would smear out the details of the modular structure by merging 
(and making invisible) many smaller modules. The technical details of the extraction 
of the directed /c-clique modules are described in Appendix B. 
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Figure 4. The directed modules of the word "GOLD" at k = 4, w* = 0.023 in the 
word association network |37j . The modules are colour coded and the overlaps between 
the modules are displayed in red. The size of each node is proportional to the number of 
modules it participates in (some of them are not shown in this figure) . Beside the name 
of the nodes we display their W" out — wf out /(wf in +wf out ) values as well. Nodes with 
high W {e.g. "SAPPHIRE") usually correspond to special, rarely used words, whereas 
nodes with low relative out-degree {e.g. "MONEY") are very common. 

4-1. Word association graph 

We examined the directed network obtained from the South Florida Free Association 
norms list (containing 10617 nodes and 63788 links), where the weight of a directed link 
from one word to another indicates the frequency that the people in the survey associated 
the end point of the link with its start point [37] . For illustration in FigJH we show the 
(colour coded) modules of the word "GOLD" obtained at k = 4 and w* = 0.023, with the 
overlaps emphasised in red. According to its different meanings, this word participates 
in four, strongly internally connected modules. Beside the node labels we display the 
relative out-strength of the nodes in the modules using (|26| . Apparently, nodes with 
a special/particular meaning (e.g. "SAPHIRE") tend to get high relative out-strength 
whereas commonly used words with general meaning (e.g. "MONEY") have low relative 
out-strength. Thus, it seems that the overall directionality of the modules is from special 
words towards more general words. To make this observation more quantitative, we 
measured the number of hits obtained for the different words appearing in the network 
using the search engines of Google. In FigJH we show the scatter plot of the number of 
hits as a function of the relative out-strength of the members of the modules obtained 
at the optimal k = 4, w* = 0.016 parameters. The decreasing tendency of the number 
of hits with increasing W" out signals that words with higher relative out-strength (i.e. , 
having mainly out-neighbours) are usually less frequently used than words with lower 
relative out-strength (i.e. , having mainly in-neighbours). 
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Figure 5. The number of hits obtained from Google for module members as a function 
of their relative out-strength in the word association network. The number of hits is 
decreasing with increasing W^ a ou t5 therefore, frequently used words are likely to obtain 
a low relative out-strength. 

4-2. Google's web-pages 

In addition to being a prominent means of information retrieval, Google provides its 
own documents as well: usage notes, feature and product descriptions, etc. . The map 
of hyper-links among Google's own web-pages offers a unique insight into how one of the 
major search portals arranges online content and thereby helps and guides our browsing. 
Excluding dynamic content and catalogues, we downloaded with our robot [38J 15, 763 
web-pages (nodes) and 171,206 directed links among them. For the current analysis, 
we excluded international pages and nodes farther than 3 steps from the start node, 
http: / / www.google.com[ and obtained a graph with 946 nodes and 1,817 links. Fig.[6j 
shows three of the many overlapping directed modules identified by the CPMd in this 
network at k — 6. Apparently each of the identified overlapping modules in Fig. [6] is a 
group of internally densely connected nodes organised around a well-defined topic (jobs, 
accounts and enterprise solutions). 

An interesting feature of Google's directed modules is that they share their in-hubs, 
but not their out-hubs. (By "in-hub" we mean nodes with outstanding in-degree, whereas 
"out-hub" stands for nodes with outstanding out-degree). This structure enhances 
browsing efficiency. Having visited a particular, "outlying" page of a module, one can 
quickly return to a node in the core of the same module. Then, due to the strong 
overlaps among the cores, one can quickly jump over to a new topic, i.e. , the web-pages 
of another module. In summary, our ability to browse efficiently and hierarchically 
Google's web-pages is enhanced by the facts that modules overlap via their in-hubs. 




Directed network modules 



13 



# * enterprise/government (1.6) **, 

,.*' enterprise /\v adduri to.s3) 



enterprise/whygoogle.html (0.50), 



>enlerprise/news_evenls.hlml (043)* * * *" enTerprise/gsa/onebox.html (0.64) \ 

enterprise/support, lit ml (0.43) ° " » 
\ yf. %-:■•>* * tenterprise/gep (0.43) * 

* *v ~ ~ - ~T\ t • "** r *y >r sei'VR't's/weli^Ciiivh.liUiil i 1.0) 

*,,enterprise/apps (0.91 ) ' x r > * H ** fc * fc * k ^ . ,>j5* 

enterprise (0.52* V* ^ * *» * , T en!eiprise/mini (0.39) 

* enterprise/customers. htmf (0.43) *, m „ v _ - , « . 

" " " " ' ■ : : ; - ,enteipi-ise/gsa (0.45)' 
. - psearch (0.17)» -*.."«;»„ > "* 



# * accounts/Login (0.83) , 



about.html (0.33 1 (0.06 ) 

sitemap.html ( 1 .0), (0.83) V 

•i+ > > •jc.btto.5oV 

jobs 



accounts/NewAccoudt (0.67)o 



terms_of_service.html (Q.18)\ I 
re&S' * 4 fiewsalerts (0.561L \\ / X^^^I 



■,help/faq_accounts.html (0.54). ' * lhtt P sl llcc0Lmts < a60 » 

alerts (0.S 



"jobs/working. html (0.50) 
• Jobs/international html (0 67) 



accounts 



». accounts (0"89) \L .r> 

accounts/f orgotPasswd (0.67) 



acJounts/ServiceLogin (0.83.),- + WWW.gOOgle.com (0.0),(0.08),(0.0) 



Figure 6. Three of the overlapping directed modules identified by CPMd in the 
directed net of Google's static pages at k = 6. These modules overlap with several 
further ones not shown in the figure; the size of each node is proportional to the number 
of its modules. The nodes and links of the three modules are coloured brown, green 
and blue, while their overlaps, i.e. , nodes contained by more than one of these three 
modules, are red. The node marked with a + sign at centre is the starting page, 
http://www.google.com, and the names of the other nodes are their URLs without 
this prefix. The D values of the module members are marked beside the node labels. 
Observe that each module contains a number of nodes with many incoming links (a 
"core"), some of which are in the overlaps. See text for further details and Fig. [9] for a 
detailed analysis of hubs and overlaps. 



4-3. Email network 

A very common type of directed social networks is the one defined by messages and 
information flow (directed links) among individuals (nodes). To "measure" such a social 
network, Ebel and Bornholdt [39J processed the directed network defined by the emails 
of students at the University of Kiel during a period of 112 days. We analysed both 
the entire data set and its subset containing only emails between internal addresses 
(students). The full network contains 57, 158 nodes and 103, 701 links, while the 1, 267 
internal addresses (nodes) are connected by 1, 659 links. Fig. [7] shows the directed 
modules in these two networks. Observe that even among the relatively small number of 
internal emails modules, overlaps do appear, e.g. , node 5886 at the centre. In the full e- 
mail data set, external addresses have both the highest degrees (number of connections) 
and the largest numbers of modules they participate in. In contrast to e.g. the Google's 
web-pages, nodes with the largest out-degrees participate in a high number of modules. 

4-4- The transcriptional regulatory network in yeast 

In a cell the transcription of a gene is influenced (regulated) by one or more proteins 
called transcription factors. This regulatory relationship is most often represented as 
a directed link pointing from the regulating protein (source node) to the protein of 
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directed modules in the network of 



a) internal e-mails (a subset of all messages) b) all e-mails (internal + external) 




Figure 7. All directed modules in a network of student emails at the University of 
Kiel during a period of 112 days (data from Ref. |39 ). On the left (a) only the graph 
of internal emails (between students of the university) is analysed, while on the right 
(b) internal and external messages are both included. Circles and boxes show internal 
and external email addresses, respectively, and the size of a node is proportional to the 
number of its modules. The largest nodes, i.e. , those with the highest membership 
number, have significantly more outgoing than incoming links, meaning that in this 
email network modules share their out-hubs. See also Fig[9] The optimal fc-clique size 
parameter values are k — 3 (a) and k = 4 (b). 

the regulated gene (target node). Recent experimental and computational techniques 
[lOj HI] have enabled the genome- wide mapping of transcription regulatory relationships 
in the yeast, S. cerevisiae. 

In FigJHl we display the obtained directed modules for k = 3. As an example, 
for some of the modules the most significant common functions of their participating 
proteins have been identified from the Gene Ontology protein function annotation 
database [42] with the search tool GO TermFinder [43]. The list of regulatory 
interactions was obtained from Ref. [JT] ■ Most protein modules in FigJH] are arranged 
around a small number of large out-hubs, the major transcription factors (TFs), each 
of which regulates a large portion of all target genes in the module. Overlaps between 
the modules occur either through the TFs, e.g. , via the nodes Met4 and Gcn4 in the 
bottom left part of the figure, or via large groups of regulated (target) genes, see, e.g. , 
the red nodes at the "interface" between the yellow and brown modules in the upper 
part of the figure. Hence, from the point of view of directed modules, the transcription 
regulation network is organised in a similar way to the e-mail network, and an opposite 
way to Google's web-pages (and the word association network). 
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Figure 8. The directed modules of the web of transcription regulatory interactions in 
baker's yeast (k = 3). Each node shows one gene (and its protein) and a directed link 
stands for a transcription regulatory interaction between a protein and the target gene. 
Modules (communities) are coloured and overlaps are red. The overlapping nodes are 
mostly out-hubs. Group functions have been identified by GO TermFinder [43) . 



4-5. Comparison between CPMd and CPM 

For each studied network, by ignoring the directionality of the links, we located the 
CPM communities as well. In case of the word association network, where links are 
weighted as well, the weight of the undirected counterpart of a double link was defined 
as the sum of the corresponding two weights. Due to this difference in the weights as 
well as in the definition of modules, the optimal weight threshold was slightly different 
in the CPM approach. 

Surprisingly, in spite of the restrictions of the CPMd compared to CPM, (and in 
case of the word association network, the difference in link weights), about 70% of the 
modules were the same in the two approaches for the word association network and 
Google's web pages, whereas this ratio turned out to be even higher (around 90%) 
for the email network and the transcription regulatory graph. Furthermore, for the 
rest of the directed modules one could find a relatively similar undirected module in 
most of the cases. This shows that the original CPM approach to the identification of 
modules is quite robust, our restrictions introduced in the CPMd leave the majority of 
the undirected modules intact. 
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Figure 9. The average membership of a node vs. its d ut/(^in + d ou t) ratio. This 
function is a growing (decreasing) one, if the modules are more likely to overlap via 
in- hubs (out-hubs). 



4-6. Classification of real-world networks: modules are connected by in-hubs or 
out-hubs 

An important aspect of network motifs (overrepresented small sub-graphs with a given 
structure) is that complex networks can be classified based on their motif significance 
profile (a pattern of motif usage) In a somewhat similar approach, here we classify 
the four investigated real-world webs into two major groups based on the overlaps of 
their directed modules. 

Interestingly, the way that the out-hubs and in-hubs of the network are arranged 
within its directed modules is different among the various types of networks. To directly 
compare the studied networks from this aspect, in Figj9] we show the average number of 
modules of the nodes as a function of their relative out-degree D i out = d ifiUt / (d ijin -\-d itOVLt ) 
ratio. Apparently, the modules in the word association network and Google's web-pages 
are connected by in-hubs: nodes contained by a large number of modules have a small 
D ifiUt . In contrast, in the email network and the transcription regulatory graph of yeast 
the overlaps are more likely to contain out-hubs than in-hubs. 

The plausible reason for the observed difference between the investigated networks 
is that overlaps contain hubs with increased likelihood in the first place, and the two 
kinds of hubs occur in the networks with different probabilities. In the word- association 
network and Google's web pages in-hubs are more frequent: the number of words we 
associate to a cue word and the number of hyper-links that appear on a web page is 
more or less constant, however a word with a general meaning or an important (general) 
web page can appear as the target for many links. In contrast, we are more likely to find 
out-hubs than in-hubs in the email network and the transcription regulatory graph. The 
time spent on sending an email does not depend on the number of recipients, whereas 
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reading a large number of incoming emails can take a lot of time, therefore being an 
in-hub in the email network is disadvantageous and in-hubs are rare. Similarly, in 
case of the transcription regulatory graph the number of transcription factors that can 
regulate a given protein is more or less constant, whereas a single transcription factor 
can regulate many other proteins in parallel, therefore, out-hubs are much more frequent 
than in-hubs. 

5. Summary and conclusions 

We examined the directionality of network modules. To compare and order the nodes 
in a module, we introduced the relative out-degree, measuring the relative weight of 
the out-links of a member to other nodes in the module. We developed a specific 
module finding algorithm for directed networks as well, based on the /c-clique percolation 
approach. Even though the CPM can be extended to any kind of directed /c-cliques 
(containing an arbitrary set of directed links), here we concentrated on the most plausible 
choice which allows a straightforward theoretical and numerical analysis. Following 
a simple branching procedure, we have derived the critical point of the directed k- 
clique percolation in the ER graph in the large N limit. The theoretical prediction was 
justified by numerical simulations. We have also studied the directed modular structure 
of real-world networks including a word association graph, Google's web pages, an e- 
mail network and the transcription regulatory network of yeast. The obtained modules 
were validated by additional information (annotations) for the members. The nodes 
contained in the overlaps between the modules enabled us to classify the examined 
networks in two major groups: the modules in the word association graph and Google's 
web pages are likely to be connected by in-hubs, whereas the overlaps in the e-mail 
network and the transcription regulatory network are more likely to contain out-hubs. 
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Appendix A 

In this appendix we show that for /c-cliques with no double links, the following three 
statements are equivalent: 

(i) Any directed link in the /c-clique points from a node with a higher order (larger 
restricted out-degree) to a node with a lower order. 

(ii) The /c-clique contains no directed loops. 

(iii) The restricted out-degree of each node in the /c-clique is different. 
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(The restricted out-degree of a node is equal to the number of its out-neighbours in the 
A;-clique) . 

(ii)— >(iii) : If loops are absent, then all the members have different restricted out-degrees. 
If there are no loops, then there must be a node in the /c-clique having all in-neighbours 
amongst the other members, since otherwise we could hop from node to node following 
a directed link inside the /c-clique forever, (which would mean that it does contain at 
least one loop). If we reversed the direction of all links inside the A;-clique we would 
not induce any loops, and therefore, this "reversed" configuration would have a member 
with only incoming links from the others as well. From this it follows that there must 
be also a node in the /c-clique with only out links towards the other nodes. By removing 
this node we obtain a (k — l)-clique in which directed loops are absent. Similarly to the 
previous case, this (k — l)-clique must have a node with only out-neighbours amongst 
the other members of the (k — l)-clique. By removing this node as well, we arrive at a 
(k — 2)-clique containing no loops. And so on, by subsequently removing the node with 
only out-neighbours at each step we iterate over all nodes, and obviously the restricted 
out-degree of the removed node is decreased by one at each step, hence all nodes have 
different number of out-links inside the /c-clique. 

(ii) -^(i) : // loops are absent, then the links point from higher restricted out-degrees 
values towards lower ones. 

The above process showing (ii)^(iii) also reveals that the links inside a A;-clique with 
no loops are always pointing from a node with a higher restricted out-degree towards a 
node with less out-links inside the /c-clique. 

(iii) -^(ii) : If all nodes have different number of out-neighbours inside the k-clique, then 
directed loops are absent. 

The possible number of out-neighbours a node can have inside a /c-clique falls in a range 
between and k — 1, therefore, if all nodes have different number of out- neighbours, 
then all of these possible values must actually appear in the /c-clique. Since double links 
are absent, the node with k — 1 out-links cannot have any incoming links from the other 
members, therefore, it is surely not part of any directed loops inside the /c-clique. The 
node with k — 2 out links has only a single incoming link, starting at the node with only 
out-links. Therefore, this node cannot be part of any directed loops either. Similarly, 
the node with k — 3 out links has two incoming links, both starting at nodes that have 
been already shown to be excluded from any directed loops (the nodes with k — 1 and 
k — 2 out-neighbours, respectively). Thus, the node with k — 3 out neighbours amongst 
the other members "inherits" this property (to be excluded from directed loops inside 
the /c-clique) as well. And so on, by subsequently scanning the nodes in decreasing or- 
der of their restricted out-degrees, at each step all the incoming links to the node under 
investigation come from previously examined members that were shown to be excluded 
from loops, therefore, the investigated node cannot be part of any loops either. 
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(i)— >-(iii) : If each directed link points from a node with a higher restricted out-degree to 
a node with a lower one, then the restricted out-degree of each node in the k-clique is 
different. 

This statement is almost trivial, since if any pair of nodes had the same restricted out- 
degree, then the link connecting them would point in the direction of constant restricted 
out- degree. 

For fc-cliques with double links none of the three statement can hold. The 
presence of loops is trivial: a double link is already equivalent of a closed directed 
path. Furthermore, both constituents of a double link cannot point in the direction of 
decreasing order simultaneously. Therefore, we only have to prove that (iii) cannot be 
true either, z.e.for /c-cliques with double links their members cannot have all different 
numbers of out-neighbours amongst the other nodes in the fc-clique. The total number 
of links, m, inside a fc-clique can be written as 



where q runs over the possible number of out-neighbours, and n q is the number of mem- 
bers with the given restricted out-degree. When all the members have different number 
of out-links, n g — 1 for all possible q values, and thus, m = kik — l)/2, which is exactly 
the number of links in a /c-clique with no double links. However, in presence of dou- 
ble links m > k(k — l)/2, therefore, at least one of the n q values in (jBJ) must be larger 
than one, meaning that there are nodes in the fc-clique with equal restricted out-degrees. 

Appendix B 

In this section we briefly describe our algorithm for extracting the CPMd modules in 
networks. Since any subgraph of a directed /c-clique is a directed fc-clique as well, (with 
a smaller k value), an efficient way to extract the directed A;-clique modules of a network 
is to first find all directed cliques first: A directed clique is a maximal directed fc-clique, 
i.e. it is not part of an even larger directed fc-clique. A CPMd module of a given k is 
equivalent of the union of directed cliques of size larger or equal to k, which can be 
reached from each other through overlaps of size larger or equal to k — 1. 
We extracted the directed cliques using the following iteration 

(i) find all directed cliques of a given node, 

(ii) remove the node and its links from the network. 

To find the directed cliques of a given node, A, we use a back-tracing algorithm based 
on the hierarchical properties of the directed cliques. At the initial step we construct 
two containers, one for the in-neighbours and one for the out-neighbours of A. The 



fc-i 




(8) 
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hierarchy of the system at this point is illustrated in FigJTUb: the in-neighbours are 
at the top, the out-neighbours are at the bottom, and the node A itself is in-between 
them. Next we take a node from the in-neighbours (or the out-neighbours), this node 
and A form a directed 2-clique. We place the node above (or below) A, and filter the 
remaining nodes in the containers so that for both nodes in the newly formed 2-clique 
it is true that 

• the members in the containers above the node in the hierarchy are all in-neighbours 
of node A, 

• the members in the containers below the node in the hierarchy are all out-neighbours 
of node A. 

If necessary, we may introduce a new container as well, e.g. , in FigflQb. by picking 
node B from the top container, the node E which is an out-neighbour of B and an 
in-neighbour of A is placed in a container in-between B and A in the hierarchy. This 
way when picking the next node from any of the containers, its rank in the hierarchy 
inside the forming directed clique coincides with the rank of its container with respect to 
the already selected nodes. For example, when picking node C in the example shown in 
FigJTUl it is placed above node B. By recursively picking new nodes from the containers, 
filtering the containers and introducing new containers we build up a directed clique. 
(The extraction of the clique ends when all containers become empty). 




Figure 10. Illustration of the directed clique search, a) The neighbourhood of node A 
in a hypothetical directed network, b) The initial state of the directed clique extraction 
algorithm: the in-neighbours of A are above A, whereas its out-neighbours are below 
it. c) Node B is picked from the in-neighbours and is placed above A. Nodes D and F 
are not neighbours of B, therefore they are removed from the containers. Furthermore, 
a new container is introduced holding node E, which is in-between B and A in the 
hierarchy, d) Node C is picked from the top container and is placed above B, node H 
is removed from the bottom container as it is not linked to C . 

Our algorithm scales similarly to the original CPM (see the Supplementary 
Information of Ref.[19j). Since the determination of the full set of cliques of a graph is 
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widely believed to be a non-polynomial problem, the extraction of the directed cliques is 
non-polynomial as well. In spite of this, in real networks our algorithm proves to be quite 
efficient. Our experience shows that the required CPU time depends on the structure of 
the input data very strongly, therefore, in general no closed formula can be given even 
to estimate the system size dependence. As an illustration of the computational speed, 
however, we note that a complete analysis of the word-association network with over 
70,000 links takes less than 5 minutes on a PC. By extracting the directed modules of 
this system at different link-weight thresholds, the time dependence of the algorithm 
could be fitted with t = AM Bln( - M ^ where t denotes the time needed by our program, 
M stands for the number of edges, and A and B are fitting parameters. 
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